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Abstract. In this work we consider the problem of maximizing the 
PageRank of a given target node in a graph by adding k new links. 
We consider the case that the new links must point to the given target 
node (backlinks). Previous work shows that this problem has no fully 
polynomial time approximation schemes unless P — NP. We present a 
polynomial time algorithm yielding a PageRank value within a constant 
factor from the optimal. We also consider the naive algorithm where we 
choose backlinks from nodes with high PageRank values compared to the 
outdegree and show that the naive algorithm performs much worse on 
certain graphs compared to the constant factor approximation scheme. 

1 Introduction 

Search engine optimization (SEO) is a fast growing industry that deals with opti- 
mizing the ranking of web pages in search engine results. SEO is a complex task, 
especially since the specific details of search and ranking algorithms are often 
not publicly released, and also can change frequently. One of the key elements 
of optimizing for search engine visibility is the "external link popularity" , which 
is based on the structure of the web graph. The problem of obtaining optimal 
new backlinks in order to achieve good search engine rankings is known as Link 
Building and leading experts from the SEO industry consider Link Building to 
be an important aspect of SEO [5]. 

The PageRank algorithm is one of the more popular methods of defining a 
ranking among nodes according to the link structure of the graph. The definition 
of PageRank ,3J uses random walks based on the random surfer model. The 
random surfer walk is defined as follows: the walk can start from any node in 
the graph and at each step the surfer chooses a new node to visit. The surfer 
"usually" chooses (uniformly at random) an outgoing link from the current node, 
and follows it. But with a small probability at each step the surfer may choose to 
ignore the current node's outgoing links, and just zap to any node in the graph 

* This work expands our previous work 'W with analysis of the proposed algorithm, 
showing a lower bound for the approximation ratio the algorithm can achieve. 



(chosen uniformly at random) . The purpose of zapping is to make all nodes in the 
graph reachable, and to make the start of the walk irrelevant (for the purposes of 
PageRank). The random surfer walk is a random walk on the graph with random 
restarts every few steps. This random walk has a unique stationary probability 
distribution that assigns the probability value tt^ to node i. This value is the 
PageRank of node i, and can be interpreted as the probability for the random 
surfer of being at node i at any given point during the walk. We refer to the 
random restart as zapping. The parameter that controls the zapping frequency 
is the probability of continuing the random walk at each step, a > 0. The high 
level idea is that the PageRank algorithm will assign high PageRank values to 
nodes that would appear more often in a random surfer type of walk. In other 
words, the nodes with high PageRank are hot-spots that will see more random 
surfer traffic, resulting directly from the link structure of the graph. If we add a 
small number of new links to the graph, the PageRank values of certain nodes 
can be affected very significantly. The Link Building problem arises as a natural 
question: given a specific target node in the graph, what is the best set of k links 
that will achieve the maximum increase for the PageRank of the target node? 

We consider the problem of choosing the optimal set of backlinks for maxi- 
mizing TTj., the PageRank value of some target node x. A backlink (with respect 
to a target node x) is a link from any node towards x. Given a graph G(V, E) 
and an integer k, we want to identify the fc > 1 links to add to node a; in G in 
order to maximize the resulting PageRank of x, n^. Intuitively, the new links 
added should redirect the random surfer walk towards the target node as often 
as possible. For example, adding a new link from a node of very high PageRank 
would usually be a good choice. 

1.1 Related Work and Contribution 

The PageRank algorithm [3] is based on properties of Markov chains. There 
are many results related to the computation of PageRank values |6|2j and re- 
calculating PageRank values after adding a set of new links in a graph [T]. 

The Link Building problem that we consider in this work is known to be 
NP-hard [3] where it is even showed that there is no fully polynomial time ap- 
proximation scheme (FPTAS) for Link Building unless NP = P and the problem 
is also shown to be W[l]-hard with parameter k. A related problem considers the 
case where a target node aims at maximizing its PageRank by adding new out- 
links. Note that in this case, new outlinks can actually decrease the PageRank of 
the target node. This is different to the case of the Link Building problem with 
backlinks where the PageRank of the target can only increase [T] . For the prob- 
lem of maximizing PageRank with outlinks we refer to |1I4| containing, among 
other things, guidelines for optimal linking structure. 

In Sect. [2] we give background to the PageRank algorithm. In Sect. |3] we 
formally introduce the Link Building problem. In Sect. [XT] we consider the naive 
and intuitively clear algorithm for Link Building where we choose backlinks 
from the nodes with the highest PageRank values compared to their outdegree 
(plus one). We show how to construct graphs where we obtain a surprisingly 



high approximation ratio. The approximation ratio is the value of the optimal 
solution divided by the value of the solution obtained by the algorithm. In Sect. 
|3.2[ we present a polynomial time algorithm yielding a PageRank value within 
a constant factor from the optimal and therefore show that the Link Building 
problem is in the class APX. 

2 Background: The PageRank Algorithm 

The PageRank algorithm was proposed by Brin, Page and Brin and Page [3] as 
a webpage ranking method that captures the importance of webpages. Loosely 
speaking, a link pointing to a webpage is considered a vote of trust for that 
webpage. A link from an important webpage is better for the receiver than a 
link from an unimportant webpage. 

We consider directed graphs G — (V, E) that are unweighted and therefore 
we count multiple links from a node w to a node ?; as a single link. The graph 
may represent a set of webpages V with hyperlinks between them, or any 
other linked structure. 

We define the following random surfer walk on G: at every step the random 
surfer will choose a new node to visit. If the random surfer is currently visiting 
node u then the next node is chosen as follows: (1) with probability a the surfer 
chooses an outlink from u, (m, v), uniformly at random and visits v. If the current 
node u happens to be a sink (and therefore has no outlinks) then the surfer picks 
any node v £V uniformly at random, (2) with probability 1 — a the surfer visits 
any node v £ V chosen uniformly at random- this is referred to as zapping. A 
typical value for the probability a is 0.85. The random surfer walk is therefore 
a random walk that usually follows a random outlink, but every few steps it 
essentially restarts the random walk from a random node in the graph. 

Since the new node depends only on the current position in the graph, the 
sequence of visited pages is a Markov chain with state space V and transition 
probabilities that can be defined as follows. Let P — {py } denote a matrix 
derived from the adjacency matrix of the graph G, such that pij = ^^^^^^^ .^ if 

G E and otherwise (outdeg(i) denotes the outdegree of «, the number of 
out-going edges from node i €V).li outdeg(i) — then pij — ^. The transition 
probability matrix of the Markov chain that describes the random surfer walk 
can therefore be written as Q = ■^^^%i,n + ctP, where 1„_„ is an n x n matrix 
with every entry equal to 1. 

This Markov chain is aperiodic and irreducible and therefore has a unique 
stationary probability distribution tt - the eigenvector associated with the domi- 
nant eigenvalue of Q. For any positive initial probability distribution over V , 
the iteration xj^Q' will converge to the stationary probability distribution tt"^ 
for large enough /. This is referred to as the power method |^. 

The distribution tt = (tti, . . . , 7r„)"^ is defined as the PageRank vector of G. 
The PageRank value of a node u € V \s the expected fraction of visits to u 
after i steps for large i regardless of the starting point. A node that is reachable 



from many other nodes in the graph via short directed paths wiU have a larger 
PageRank, for example. 

3 The Link Building Problem 

The k backlink (or Link Building) problem is defined as follows: 
Definition 1. The LINK BUILDING problem: 

— Instance: A triple {G,x,k) where G{V,E) is a directed graph, x £ V and 
k € 

— Solution: A set S ^ V \ {x} with \S\ ^ k maximizing in G{V, E U {S X 

W))- 

For fixed k = 1 this problem can be solved in polynomial time by simply 
calculating the new potential PageRanks of the target node after adding a link 
from each node. This requires 0{n) PageRank calculations. The argument is 
similar for any fixed k. As mentioned in Sect. [lT] if k is part of the input then 
the problem becomes NP-hard. 

3.1 Naive Selection of Backlinks 

When choosing new incoming links in a graph, based on the definition of the 
PageRank algorithm, higher PageRank nodes appear to be more desirable. If we 
naively assume that the PageRank values will not change after inserting new 
links to the target node then the optimal new sources for links to the target 
would be the nodes with the highest PageRank values compared to outdegree 
plus one. This leads us to the following naive but intuitively clear algorithm: 



Naive(G, x, k) 

Compute all PageRanks iVi, for all {i £ V : {i,x) ^ E) 

Return the k webpages with highest values of ^^^^ , where di is the outdegree of page i 
Fig. 1. The naive algorithm 



The algorithm simply computes all initial PageRanks and chooses the k nodes 
with the highest value of ^^^^ . It is well understood [8J that the naive algorithm 
is not always optimal. We will now show how to construct graphs with a sur- 
prisingly high approximation ratio - roughly 13.8 for a = 0.85 - for the naive 
algorithm. 



Lower Bound for the Approximation Ratio of the Naive Algorithm 

We define a family of input graphs ("cycle versus sink" graphs) that have the 
following structure: There is a cycle with k nodes, where each node has a number 
of incoming links from tc other nodes (referred to as tail nodes) . Tail nodes are 
used to boost the PageRanks of certain pages in the input graph and have an 
outdegree of 1. There are also k sink nodes (no outlinks) each one with a tail of 
ts nodes pointing to them. The target node is x and it has outlinks towards all 
of the sinks. Figure [2] illustrates this family of graphs. Assume also that there 
is an isolated large clique with size ti. The purpose of this clique is essentially 
to absorb all the zapping traffic. Intuitively, this makes the linking structural 
elements more important. Later we also give the bound without this clique, and 
see that it is worse. 




Fig. 2. A "cycle versus sink" graph for the naive algorithm. 



Due to symmetry all pages in the cycle will have the same PageRank tTc and 
the k sink pages will have the PageRank tt^. All tail nodes have no incoming 
links and, due to symmetry, will have the same PageRank denoted by ttj. The 
PageRank of the target node is ir^ and the PageRank of each node in the isolated 
clique is tt^. 

The initial PageRanks for this kind of symmetric graph can be computed by 
writing a linear system of equations based on the identity tt'^ — tt^Q. The total 
number of nodes is n = fc (is + tc + 2) + + 1. 
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We need to add k new links towards the target node. We will pick the sizes 
of the tails tg and therefore the PageRanks in the initial network so that 
the PageRank (divided by outdegree plus one) of the cycle nodes is slightly 
higher than the PageRank over degree for the sink nodes. Therefore the naive 
algorithm [l] will choose to add k links from the k cycle nodes. Once one link has 
been added, the rest of the cycle nodes are not desirable anymore, a fact that 
the naive algorithm fails to observe. The optimal solution is to add k links from 
the sink nodes, as each node in a sense directs independent traffic to the target. 

In order to make sure cycle nodes are chosen by the naive algorithm, we need 
to ensure that — . , ^"z n , , > — , , ^ , , 4^ ^ > ng ^ t^cIt^s = 2 + (5 for some 
6 > 0. We then parameterize our tails: 
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where u determines the size of the graph and A is the solution of tTc/tTs = 2+5, 
giving 

_ ((a^ -a) 5 + 2a^) ku + 2 {{a - 1) 6 + 2a - 1) k + 2 {a^ - a) 5 + 4:{a^ - a) 
~ 2 a^ku + ((2 a2 - 2 a) (5 + 4 a2 2 a) fc + (2 a3 - 2 a2) (5 + 4 a3 - 4 a2 



We can solve for A for any desired value of S. Note also that we choose the 
tails of the clique nodes to be in order to make them asymptotically dominate 
all the other tails. The naive algorithm therefore will add k links from the cycle 
nodes which will result in the following linear system for the PageRanks: 





1 - 


a ak?:^ 
- + - 








n 


n 






-ak^ 
2 


< 


= 7rfH 






= 7rfH 


- « (7^,72 + 




= 7rfH 


-anf . 



The optimal is to choose k links from the sink nodes with a resulting Page- 
Rank vector described by the following system: 



71 

TT° = TT° + ak7T° 

<=< + a< . 

We solve these systems and calculate the approximation ratio of the naive 
algorithm: 

n° _ {a^ -2a^) kts + {a'^ -2a) k + a-2 
ttI ~ {a'^-a^) ktc + {a^~a) k~a^ + 2a^+a-2 ' ^ ' 

We now set our tails as described above in Equations T][3 and let u,k oo. 



So for large values of the tail sizes we get the following limit: 

«,fc^oo7ri (a3-a2_ck + l) (5 + 2a3-2a2-2a + 2 ' ^' 

Now letting (5 — > (as any positive value serves our purpose) we get the 
following theorem. 



Theorem 1. Consider the Link Building problem with target node x. Let G = 
{V, E) be some directed graph. Let 7r° denote the highest possible PageRank that 
the target node can achieve after adding k links, and 7r| denote the PageRank 
after adding the links returned by the naive algorithm from Fig. [7| Then for any 
e > there exist infinitely many different graphs G where 

K 2 - a 

^ > 2 (1-a) (1-^2) ■ 

Note that e can be written as function of u, 5, k and a. As u, k ^ oo, e — >■ 
giving the asymptotic lower bound. For a — 0.85 the lower bound is about 13.8. 

To see that the large isolated clique is necessary, we follow the same pro- 
cedure as above but setting ti = 0, which gives us the inferior bound 

2a^ 



4q;3-6q!2-4q; + 6 
which is only about 4.7 for a — 0.85. 



3.2 Link Building is in APX 



In this section we present a greedy polynomial time algorithm for Link Building; 
computing a set of k new backlinks to target node x with a corresponding value 



of TT^ within a constant factor from the optimal value. In other words we prove 
that Link Building is a member of the complexity class APX. We also introduce 
Zij as the expected number of visits of node j starting at node i without zapping 
within the random surfer model. These values can be computed in polynomial 
time [T]. 



Proof of APX Membership Now consider the algorithm consisting of k steps 
where we at each step add a backlink to node x producing the maximum increase 
in ^ — the pseudo code of the algorithm is shown in Fig. 3l This algorithm runs 
in polynomial time, producing a solution to the Link Builamg problem within a 
constant factor from the optimal value as stated by the following theorem. So, 
Link Building is a member of the complexity class APX. 



r-Greedy(G, x, k) 

5 :=0 

repeat k times 

Let M be a node which maximizes the value of in G(V, E U {{u, x)}) 
S--SU{u} 
E := EVJ{{u,x)} 
Report S as the solution 



Fig. 3. Pseudo code for the greedy approach. 



Theorem 2. We let tt^ and z^^ denote the values obtained by the r-Greedy 
algorithm in Fig.^ Denoting the optimal value bye tt", we have the following 

where e ~ 2.71828 . . . and z°^ is the value of z^^ corresponding to 7r°. 
Proof. Proposition 2.1 in [IJ by Avrachenkov and Litvak states the following 

T^x = - — -Zxx{^ + '^ri^) . (7) 

where rix is the probability that a random surfer starting at i reaches x before 
zapping. This means that the algorithm in Fig. [3] greedily adds backlinks to x in 
an attempt to maximize the probability of reaching node x before zapping, for 
a surfer dropped at a node chosen uniformly at random. We show in Lemma [T] 
below that ri^ in the graph obtained by adding links from X C y to a; is a 
submodular function of X - informally this means that adding the link (u, x) 
early in the process produces a higher increase of rix compared to adding the 
link later. We also show in Lemma [2] below that Vix is not decreasing after 



adding (it, x), which is intuitively clear. We now conclude from (j?) that is a 
submodular and nondecreasing function since is a sum of submodular and 
nondecreasing terms. 

When we greedily maximize a nonnegative nondecreasing submodular func- 
tion we will always obtain a solution within a fraction 1 — ^ from the optimal 
according to ^7] by Nemhauser et al. We now have that: 

^ < /i In 

> —(1 - -) . 

Finally, we use the fact that z^^ and are numbers between 1 and jz^- 

a 

For a = 0.85 this gives an upper bound of ^ of approximately 5.7 It must 
be stressed that this upper bound is considerably smaller if z^x is close to the 
optimal value prior to the modification - if z^x cannot be improved then the 
upper bound is = 1.58. It may be the case that we obtain a bigger value of 
T^x by greedily maximizing tt^ instead of , but (the PageRank of the target 
node throughout the Link Building process) is not a submodular function of X 
so we cannot use the approach above to analyze this situation. To see that tt^ 
is not submodular we just have to observe that adding a backlink from a sink 
node creating a short cycle late in the process will produce a higher increase in 
TT^ compared to adding the link early in the process. 



Proof of Submodularity and Monotonicity of Let fi{X) denote the 
value of rix in G(y, i? U (X x {x})) - the graph obtained after adding links from 
all nodes in X to x. 

Lemma 1. fi is submodular for every i €V. 

Proof. Let f[{X) denote the probability of reaching x from i without zapping, 
in r steps or less, in G{V,E U {X x {2:})). We will show by induction in r that 
f[ is submodular. We will show the following for arbitrary A C B and y ^ B: 

f[{B U {y}) - fl{B) < f[{A U {y}) - fl{A) . (8) 

We start with the induction basis r = 1. It is not hard to show that the two 
sides of ([8| are equal for r = 1. For the induction step; if you want to reach x 
in r + 1 steps or less you have to follow one of the links to your neighbors and 
reach x in r steps or less from the neighbor: 

where j i j denotes the nodes that i links to - this set includes x if i £ X. 
The outdegree of i is also dependent on X. If i is a sink in G{V, E U {X x {x})) 
then we can use (|9| with outdeg{i) — n and j : i j — V - as explained in 



Sect. [2] the sinks can be thought of as hnking to aU nodes in the graph. Please 
also note that f^i^) — 1- 

We will now show that the following holds for every i € V assuming that ([8| 
holds for every i G V: 

fr\BU{y})~f[+\B)<f[+\AU{y})-fr\A) . (10) 



1. i € A: The set j ■ i ^ j and outdeg{i) are the same for all four terms in ( 10 ). 
We use ([9]) and the induction hypothesis to see that ( 10 ) holds. 

2. i&B\A: 

(a) z is a sink in G(y, E): The left hand side of ( |lO| ) is while the right hand 
side is positive or according to Lemma [2] below. 

(b) i is not a sink in G{V, E): In this case j '■ i ^ j includes x on the left 
hand side of ( 10 1 but not on the right hand side - the only difference 
between the two sets - and outdeg{i) is one bigger on the left hand side. 
We now use the induction hypothesis and WX : fl{X) = 1. 

3. i = y: We rearrange (10 1 such that the two terms including y are the only 
terms on the left hand side. We now use the same approach as for the case 
ic,B\A. 

4. i€V\{B\J {y}): As the case i e A. 

Finally, we use limr-s.00 fi{X) ~ fi{X) to prove that ([s]) holds for /,;. □ 
Lemma 2. fi is nondecreasing for every i d V . 
Proof. We shall prove the following by induction in r for y ^ B: 

f[{BU{y})>f[{B) . (11) 

We start with the induction basis r = 1. 

1. i — y. The left hand side is ^^^^"^^^^ where outdeg{y) is the new outdegree 
of y and the right hand side is at most ^ (if y is a sink in G{V, E)). 

2. i ^ y: The two sides are the same. 



For the induction step; assume that (11 1 holds for r and all i . We will show 
that the following holds: 



fl+'{Byj{y})>fl+\B). (12) 



1. i = y: 



(a) i is a sink in G{V, E): The left hand side of ( 12 ) is a and the right hand 
side is smaller than a. 

(b) i is not a sink in G{V, E): We use ([9]) in ( 12 ) and obtain simple averages 
on both sides with bigger numbers on the left hand side due to the 
induction hypothesis. 

i ^ y: Again we can obtain averages where the numbers are bigger on the 
left hand side due to the induction hypothesis. 



Again we use limr-s.00 fliX) = fi{X) to conclude that (11) holds for fi. □ 



Fig. 4. Sink versus sinlc for tlie r-Greedy. Not sliown is a large isolated clique of size 
U. 



3.3 Lower Bound for r-Greedy 
Let 

1 ^ ^ ^ix '^xl ^xx • 
i^x 

be the reachability of node x. The r-Greedy algorithm above improves the reach- 
ability of the target node by the maximal amount at each step. However, it misses 
opportunities to improve tt^ by increasing the z^-j. value (short loops) , and so it is 
quite easy to show that there is an infinite graph family where the approximation 
ratio is 

max possible z^x ^/ ^ '^'^) 1 

min possible z^x 1 1 — 

which is equal to about 3.6 for a — 0.85. In order to force a greater approximation 
ratio, we would have to consider graph families that use the independent set 
aspect of link building, as discussed in [5] . 

Theorem 3. Consider the Link Building problem with target node x. Let G = 
{V, E) be some directed graph. Let 7r° denote the highest possible PageRank that 
the target node can achieve after adding k links, and tt° denote the PageRank 
after adding the links returned by the r-Greedy algorithm above. Then for any 
e > there exist infinitely many different graphs G where 

ttJ 1 — 

Proof. Consider the input given by Fig. [4] for some parameter ti, and t^ = ti, + 1. 
At the beginning, r^ ^ I since it is unreachable, except by zapping, from any 
other node in the graph (the large isolated clique ensures that this zapping 
traffic is insignificant). Throughout the link building process, adding a link from 
a shaded node will add a + ti, to r^ , while adding a link from a light node will 
add a + {tb + l) o? (minus some insignificant zapping traffic that we lose due to 
these nodes no longer being sinks). Hence, r-Greedy will always choose to add 
links from the light nodes and ignore the possibility of creating short loops. 



Now we analyse the effect of the links on -Ex, as we have in Sect. 3.1 First we 



consider the graph before any additional links are added. Again due to symmetry 



the k shaded nodes will have the PageRank Wb, and the light nodes tTc- All tail 
node's have; no incoming links and, due to symmetry, will have the same; PageR- 
ank denoted by ttj. The PageRank of the target node is and the PageRank 
of each node in the isolated clique is TTj. 

Again we compute the initial PageRanks for this kind of symmetric graph 
by writing a linear system of equations based on the identity = w^Q. The 
total number of nodes here is n = fc {tc + tt + 2) + ti + 1. 

1 — a ka (TTb + TTc) 

TTt = 1 

n n 

TTb = TTt+a (J^ + tbTTt^ 
TTc = TTt + a tcTTt 



1-a 

The greedy will choose k links from the light nodes as proven above (if tc > h) 
giving: 

1 — a kanl 

< = + 

n n 

= nl + kaw^ 



TT, 

~k 



1 - a 

The optimal solution is to choose k links from the shaded nodes with a 
resulting PageRank vector described by the following system: 

1 — a fc a 7r° 

TTt = 1 

^ n n 
7r° = < + fca< 

< = < + ate< 



' I- a 

We now parameterize our variables simply thus: 

tb = c 

tc = c+l 

ti = C^ . 



So that a single variable determines the size of the graph. We solve these 
systems and calculate the approximation ratio of the naive algorithm, giving an 
expression with numerator 

-((o^cfc + ak + 1)((1 - a^){c + l)k + (1 - a'^)ck+ 
{-a^ -a + 2)k + c^ -a^ + 1)) 

and denominator 

((-a^ - + a + l)(c + l)fc + (a + l)ck + (-a^ + a + 2)k+ 

{a + l)c^ +a + l){{a^ - a^){c + l)k + (a^ - a)k + a - 1) 

Letting the size of the graph go to infinity, we get the following limit. We 
note that this limit is approached from the left. 

/• < - ^ 

And the theorem statement follows from this. 

□ 

4 Discussion and Open Problems 

We have presented a constant-factor approximation polynomial time algorithm 
for Link Building. We also presented a lower bound for the approximation ratio 
achieved by a perhaps more intuitive and simpler greedy algorithm. The problem 
of developing a polynomial time approximation scheme (PTAS) for Link Building 
remains open. 
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